Overview

Corpus size

There are a total of 4412 publications in the corpus. This amounts to 92,970,837 total tokens (including punctuation marks) or 51,777,615 character tokens (excluding punctuation marks). The raw corpus takes 615 Mb space, the formatted corpus, the tokenized and nlp-processed corpus takes 6.1 Gb space. The median work in thecorpus is 3994 words long, the longest work is 6.2269210^{5} words long and the shortest work is 51 words long. Works with less than 50 tokens were excluded as they usually contained no ready-made OCR layer. The distribution of works by their length is given below. The interactive graph allows you to see the title of the work by hovering over it.

Representativity

There are a total of 8062 publications in ENB out of 39442 total publications in 1800-1940 (20.4%). 4412 of them are included in the corpus (some with multiple sources included) - 11.2% of the total published works in 1800-1940. The distribution in time is depicted below. At the time of the compilation of the corpus, the linked works were either not digitized in a way that was suitable to add to the corpus, or did not prove accessible. This situation can change in the future and new texts can be added from the linked set too.

There are a total of 1188 unique authors represented in the corpus. 40 authors have at least 10 works in the corpus, 102 have at least 5. The number of works for the 20 most prolific authors in the dataset is given below.

By genre, the distribution within the corpus is the following. Most genres have 10-20% representation in the corpus. Poetry is slightly better reprsented, scholarly works slightly less so.

Corpus contents

The most popular places of publication for the texts in the corpus are given below.

The genre distribution within the corpus is given below.

File quality

The file quality can be assessed by looking at the proportion of words that were recognized by the NLP workflow - here EstNLTK 1.4 was used. There is a baseline level of recognition for an era as the orthography becomes increasingly modern, however digitization errors or editing towards a modern standard can move the text further from the baseline. We can look at it visually by plotting the proportion of words recognized over time. We can see the transition from older writing system to a more modern one in the 1870s, as well as some texts lagging behind, and also the transition from w to v that interferes with a lot of the NLP pipelines - variation between texts is greater here as both versions remain in use for some time. Some of the variation is due to digitization errors and we can use this graph to explore where particular texts are situated. Depending on the research question, texts with too many errors may need to be excluded from analysis.

We can also use this information to assess duplicates of different texts and perhaps use the text with higher quality of digitization or check for whether the text has been edited, if the score is too high

Publications & authors on the map

## Reading layer `kih1922_region' from data source 
##   `/media/peeter/Samsung_T5/MAIN_PROJECTS/1_doktöö/code_repo/data/publish/tidy_corpus/data/external/kih1922_region.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 106 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XYZ
## Bounding box:  xmin: 369091.6 ymin: 6377139 xmax: 739142.7 ymax: 6618413
## z_range:       zmin: 0 zmax: 0
## proj4string:   +proj=lcc +lat_1=58 +lat_2=59.33333333 +lat_0=57.51755393056 +lon_0=24 +x_0=500000 +y_0=6375000 +ellps=GRS80 +units=m +no_defs

Domicile

## although coordinates are longitude/latitude, st_intersects assumes that they are planar
## although coordinates are longitude/latitude, st_intersects assumes that they are planar

Printing locations

Figure 5. Printing locations over time

Figure 5. Printing locations over time

Figure 6. Top book printing location until now

Figure 6. Top book printing location until now

Publishers

Figure 9. Top publishers in 1800-1940

Figure 9. Top publishers in 1800-1940

Figure 9. Top publishers in 1800-1940

Figure 9. Top publishers in 1800-1940

Figure 9. Top publishers in 1800-1940

Figure 9. Top publishers in 1800-1940

## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
Figure 9. Top publishers in 1800-1940

Figure 9. Top publishers in 1800-1940

Figure 9. Top publishers in 1800-1940

Figure 9. Top publishers in 1800-1940

Authors in different types

Education - when looking among the people with details on life trajectory available, we see that most of the people in the corpus attained a very high level of education at the time, usually a university degree. This is not representative of the population as a whole, however can be quite representative among the population of writers in the bibliography.

The most common native dialect among the authors is Central dialect, followed by Tartu, Western and Võru dialect. The most common chosen domicile for the authors where this information is available, is Tartu, followed by Tallinn, then Pärnu and then Peterburi.

By professions, we can see that most of the authors have either teacher, journalist or writer written also as a profession. With the following common titles being a preacher, poet, translator, prose writer, drama writer, linguist, politician. This likely reflects the literary landscape of the time quite accurately.

## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
Education distribution in corpus (max level of schooling)
edu_max_school N
NA 507
5 124
3 31
2 31
1 14
4 4
Native dialect distribution in corpus
murre N
NA 358
kesk 125
tartu 58
lääne 46
võru 33
saarte 28
mulgi 26
ida 23
kirde 10
ranna 4
Domicile distribution in corpus (top 10 locations)
domicile_places N
NA 446
Tartu%20linn 48
Tallinn 24
Pärnu%20linn 9
Peterburi 7
Vaba 6
Venemaa 6
Võru%20linn 6
Viljandi%20linn 5
Rakvere%20linn 4
Domicile distribution in corpus by dialects
domicile_murre N
NA 540
kesk 65
tartu 51
võru 17
lääne 17
ranna 10
ida 4
saarte 4
mulgi 3
Profession distribution in corpus (top 10 locations)
profession_split N
õpetaja 84
ajakirjanik 81
kirjanik 75
pastor 29
luuletaja 27
tõlkija 22
prosaist 22
näitekirjanik 14
keeleteadlane 13
poliitik 12